Day04 Outlier and some Numpy. 離群值與Numpy操作

第 11 屆 iThome 鐵人賽

DAY 4

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 4 篇

11th鐵人賽 outlier numpy pandas 離群值

kyt

2019-09-05 07:05:51

11364 瀏覽

分享至

在蒐集資料的過程中，可能會因測量方法、人為疏失或實驗誤差等，導致所蒐集到的資料中出現個別值極度異於其它資料的值，即為離群值(Outlier)；離群值可能導致分析結果難以解釋，因此需要找出這些離群值，再與具有專業領域知識的人討論造成這些離群值的原因，個別做相對應的處理，以利後續進行統計分析進行。

When collecting data, we might end up getting some outliers - the value that are really different from rest of the value, causing by the measuring method we use, human error or sometimes some experimental error. Outliers could be hard to explain in the analysis results, thus, we need to find out these outliers and cooperate with people with field knowledge, in order to figure out what is the cause of every single outlier and deal with them using proper methods.

離群值出現的可能原因：

對未知值隨意填補(約定俗成的代入)，如年齡填-1或999，電話填0900-123-456。
可能的錯誤紀錄、筆誤、系統性錯誤等，如某書在某筆訂單的銷售量為1,000本。

檢查離群值

可以針對資料繪圖(如直方圖、盒圖、分布圖等)，或透過檢查數值範圍(五值、平均數及標準差等) 來找出離群值，並要盡可能確認每一個欄位的意義，判斷離群值出現的原因。

Outliers may exist due to:

Filling in unknown values, ex: age = -1 or 999, phone number = 0900-123-156.
Clerical error, systematic error or other possible mistakes, ex: 1,000 same books in one order.

The ways to check if there's outliers:

We could plot out the data (such as box plots, scatter plots, histograms), or look up the statistical quantity (5 values, mean, standard deviation, etc) to find out outliers. We need to try our very best to understand what every column's about, and figure out the reason why there are outliers.

處理離群值

出現離群值時，要小心處理，結合專業知識和統計學方法。先認真檢查原始數據，看能否從專業角度合理的解釋，如果數據存在邏輯錯誤而原始記錄又確實如此，並無法找該觀察對象核實資料，則只能選擇將觀測值刪除。如果數據間無明顯的邏輯錯誤，則可於離群值刪除前後各做一次統計分析，若前後結果不矛盾，該觀測值可予以保留。對於離群值，便可依上述情況，針對情況因應，常見有三種方式：捨棄欄位、取代補值、或採取捨棄整欄不使用。若判斷後決定取代填補離群值，則視情況以中位數、最大最小值或N/A等值取代，常用者以及函數整理：

Dealing with outliers

We need to deal with outliers carefully, considering both the field knowledge and statistics. Have a thorough look over the data, and see if we could explain the outliers using field knowledge. If no, then we might have to give up the outlier, sometimes maybe even end up giving up the whole column. On the other hand, if the outliers do not seem too weird comparing to the rest of the data, we could try to analyse the data with and without the outliers to see the difference. So it actually really depends on the situation, normally, there are three ways to deal with outliers: remove the outliers, remove the whole column, or replace the outliers. The most commonly used values to replace outliers and the functions are as below:

先產生一些隨機數裝到清單中。

First, generate some random values into a list.

# 載入所需模塊 import packages
import random 
import numpy as np # 通常載入Numpy後會簡寫成np方便使用 import and abbreviate numpy as mp
from scipy import stats

# 生成一清單裝著20個指定範圍内的整數，指定兩數值包含且後數必須大於前數。
# create a list that contains 20 integers in 0 to 10.
arr = [random.randint(0,10) for i in range(20)]
print(arr)

中位數 Median

通過把所有觀察值高低排序後找出正中間的一個(如果有偶數個數，則為最中間兩數相加再除2)。

The value separating the higher half from the lower half (if the numbers of the data is even, the median will be adding up the two numbers in the middle and divide it by two).

np.median(arr)

分位數 Quantile

即把所有數值由小到大分成等份，q為用以指定介於(0, 1)之間百分位值。

Cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. The parameter 'q' is a value between (0, 1) to specify the percentage.

print('最小值為 Minimum：',np.quantile(arr, q=0))
print('較小四分位數為 Q1：', np.quantile(arr, q=0.25))
print('中位數為 Q2：', np.quantile(arr, q=0.5))
print('較大四分位數為 Q3：', np.quantile(arr, q=0.75))
print('最大值為 Maximum：', np.quantile(arr, q=1))